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Detailed protocol 


To reduce noise in the data, we performed two-run clustering on each dataset. Examining the result from the first run clustering, we identified contamination 
clusters and clusters that arose from unwanted factors. The second run clustering without such noise prepared the basis for data integration across multiple 
datasets. Specifically, for each dataset from droplet-based technology, if a UMl-count table was available, we re-normalized the data using the R package 
SCTransform (version 0.3.2.9004), which is tailored for droplet-based scRNA-Seq data and much more robust to noise compared with the default 
normalization method of Seurat. Genes were ranked descendingly by variance (for TPM data) or residual variance estimated from the "vst" method 
implemented in the FindVariableFeatures function of Seurat (for CPM, SCTransform, or scran normalized data). Excluding genes in a blacklist (described 
later), the top 1500 genes were identified as highly variable genes (HVG). Then we ran the Seurat pipeline for the first time: regressed out cell cycle effect and 
donor effect; performed principal component analysis (PCA) using the highly variable genes; built a Shared Nearest Neighbor (SNN) Graph using the top 15 
principal components and clustered cells using the Louvain algorithm. For all data sets, the resolution parameter of clustering was set to 2.0, which could 
produce sufficient fine clustering according to our experiments. All other parameters were kept as the default values. After clustering, limma (version 3.42.2) 
was used for detecting differentially expressed genes. For each cluster, cells were compared with cells from all other clusters. Genes with log2 fold change 
larger than a specified threshold (1.00 and 0.25 for SmartSeq2-based datasets and droplet-based datasets respectively) and false discovery rate (FDR) < 
0.01 were defined as the marker genes of the cluster. Also, analysis of variance (ANOVA) was performed to obtain F value using R package aov. 

From the first run clustering, in multiple datasets, we identified clusters with marker genes associated with tissue dissociation operation including heat shock 
protein-encoding genes (50). Recurrent marker genes of these tissue dissociation-related clusters, i.e. those that showed significant high expression in more 
than 40% tissue dissociation-related clusters, were defined as disassociation induced genes (DIG) and were included in the gene blacklist after the first run 
clustering. Further, we found that for a few datasets, there were clusters showing signature of macrophage (high expression of "LYZ","C1QA","C1QB"), alv2 
cell (high expression of "SFTPC","SFTPA1","SFTPA2"), melanocyte (high expression of "GPNMB", "TYR", "PMEL", "MLPH") or B cell. The potential 
contamination or doublet cells were removed by dropping the clusters (for B cell) or by filtering out cells with high average expression of the signature genes of 
the contaminating cell types (macrophage, alv2 cell, and melanocyte). 

Then we ran the Seurat pipeline for the second time, with the gene blacklist including DIG. The unwanted effects caused by the cell cycle, donor, percentage of 
mitochondrial UMI counts, and DIG signature were removed by regression in this run. After the pipeline was completed successfully, the genes were ranked by 
F value from ANOVA in descent order, and then the rank values were converted to percentile rank values (by dividing by the total number of genes). The filtered 
gene expression matrix and percentile rank would be used in the downstream data integration. 

The gene blacklist contains immunoglobulin genes, T cell receptor (TCR) genes from R package biomaRt (version 2.42.1); ribosome-protein-coding genes 
(gene symbol with string pattern "*RP([0-9]+-|[LS])"), gene MALAT71 , proliferation genes, and dissociation induced genes from the first run of the Seurat 
pipeline. Processes including cell cycle and tissue dissociation may influence the expression of not only the associated signature genes, but also the whole 
transcriptome. To minimize the impact of those processes, the S phase score and G2M phase score were calculated with function Cell/CycleScoring , and the 
DIG score was calculated with function AddModuleScore . Then those scores were regressed out in the Seurat pipeline. 


The code implemented the pipeline could be found in github (hitos://github.com/Japrin/scPip). 
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